AITopics | data transfer

Collaborating Authors

data transfer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Beluga: A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management

Yang, Xinjun, Hu, Qingda, Li, Junru, Li, Feifei, Zhu, Yicong, Zhou, Yuqi, Lin, Qiuru, Dai, Jian, Kong, Yang, Zhang, Jiayu, Xu, Guoqiang, Liu, Qiang

arXiv.org Artificial IntelligenceDec-1-2025

The rapid increase in LLM model sizes and the growing demand for long-context inference have made memory a critical bottleneck in GPU-accelerated serving systems. Although high-bandwidth memory (HBM) on GPUs offers fast access, its limited capacity necessitates reliance on host memory (CPU DRAM) to support larger working sets such as the KVCache. However, the maximum DRAM capacity is constrained by the limited number of memory channels per CPU socket. To overcome this limitation, current systems often adopt RDMA-based disaggregated memory pools, which introduce significant challenges including high access latency, complex communication protocols, and synchronization overhead. Fortunately, the emerging CXL technology introduces new opportunities in KVCache design. In this paper, we propose Beluga, a novel memory architecture that enables GPUs and CPUs to access a shared, large-scale memory pool through CXL switches. By supporting native load/store access semantics over the CXL fabric, our design delivers near-local memory latency, while reducing programming complexity and minimizing synchronization overhead. We conduct a systematic characterization of a commercial CXL switch-based memory pool and propose a set of design guidelines. Based on Beluga, we design and implement Beluga-KVCache, a system tailored for managing the large-scale KVCache in LLM inference. Beluga-KVCache achieves an 89.6% reduction in Time-To-First-Token (TTFT) and 7.35x throughput improvement in the vLLM inference engine compared to RDMA-based solutions. To the best of our knowledge, Beluga is the first system that enables GPUs to directly access large-scale memory pools through CXL switches, marking a significant step toward low-latency, shared access to vast memory resources by GPUs.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.20172

Country:

Europe (1.00)
Asia > China (0.94)
North America > United States > California (0.28)

Genre: Research Report > New Finding (0.67)

Industry: Information Technology (0.68)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

TASP: Topology-aware Sequence Parallelism

Wang, Yida, Hong, Ke, Li, Xiuhong, Xu, Yuanchao, Wang, Wenxun, Dai, Guohao, Wang, Yu

arXiv.org Artificial IntelligenceOct-10-2025

Long-context large language models (LLMs) face constraints due to the quadratic complexity of the self-attention mechanism. The mainstream sequence parallelism (SP) method, Ring Attention, attempts to solve this by distributing the query into multiple query chunks across accelerators and enable each Q tensor to access all KV tensors from other accelerators via the Ring AllGather communication primitive. However, it exhibits low communication efficiency, restricting its practical applicability. This inefficiency stems from the mismatch between the Ring AllGather communication primitive it adopts and the AlltoAll topology of modern accelerators. A Ring AllGather primitive is composed of iterations of ring-styled data transfer, which can only utilize a very limited fraction of an AlltoAll topology. Inspired by the Hamiltonian decomposition of complete directed graphs, we identify that modern accelerator topology can be decomposed into multiple orthogonal ring datapaths which can concurrently transfer data without interference. Based on this, we further observe that the Ring AllGather primitive can also be decomposed into the same number of concurrent ring-styled data transfer at every iteration. Based on these insights, we propose TASP, a topology-aware SP method for long-context LLMs that fully utilizes the communication capacity of modern accelerators via topology decomposition and primitive decomposition. Experimental results on both single-node and multi-node NVIDIA H100 systems and a single-node AMD MI300X system demonstrate that TASP achieves higher communication efficiency than Ring Attention on these modern accelerator topologies and achieves up to 3.58 speedup than Ring Attention and its variant Zigzag-Ring Attention. The code is available at https://github.com/infinigence/HamiltonAttention.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.26541

Genre: Research Report (0.82)

Industry: Information Technology > Hardware (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

FastTrack: GPU-Accelerated Tracking for Visual SLAM

Khabiri, Kimia, Hosseininejad, Parsa, Gopinath, Shishir, Dantu, Karthik, Ko, Steven Y.

arXiv.org Artificial IntelligenceSep-16-2025

The tracking module of a visual-inertial SLAM system processes incoming image frames and IMU data to estimate the position of the frame in relation to the map. It is important for the tracking to complete in a timely manner for each frame to avoid poor localization or tracking loss. We therefore present a new approach which leverages GPU computing power to accelerate time-consuming components of tracking in order to improve its performance. These components include stereo feature matching and local map tracking. We implement our design inside the ORB-SLAM3 tracking process using CUDA. Our evaluation demonstrates an overall improvement in tracking performance of up to 2.8x on a desktop and Jetson Xavier NX board in stereo-inertial mode, using the well-known SLAM datasets EuRoC and TUM-VI.

artificial intelligence, fasttrack, map point, (15 more...)

arXiv.org Artificial Intelligence

2509.10757

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Hardware (0.78)
Information Technology > Graphics (0.69)
Information Technology > Artificial Intelligence > Robots (0.48)

Add feedback

An Open-Source HW-SW Co-Development Framework Enabling Efficient Multi-Accelerator Systems

Antonio, Ryan Albert, Dumoulin, Joren, Yi, Xiaoling, Van Delm, Josse, Deng, Yunhao, Paim, Guilherme, Verhelst, Marian

arXiv.org Artificial IntelligenceAug-21-2025

Heterogeneous accelerator-centric compute clusters are emerging as efficient solutions for diverse AI workloads. However, current integration strategies often compromise data movement efficiency and encounter compatibility issues in hardware and software. This prevents a unified approach that balances performance and ease of use. To this end, we present SNAX, an open-source integrated HW-SW framework enabling efficient multi-accelerator platforms through a novel hybrid-coupling scheme, consisting of loosely coupled asynchronous control and tightly coupled data access. SNAX brings reusable hardware modules designed to enhance compute accelerator utilization, and its customizable MLIR-based compiler to automate key system management tasks, jointly enabling rapid development and deployment of customized multi-accelerator compute clusters. Through extensive experimentation, we demonstrate SNAX's efficiency and flexibility in a low-power heterogeneous SoC. Accelerators can easily be integrated and programmed to achieve > 10x improvement in neural network performance compared to other accelerator systems while maintaining accelerator utilization of > 90% in full system operation.

accelerator, artificial intelligence, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2508.14582

Country: Europe > Belgium (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Hardware (0.93)
Information Technology > Software (0.85)

Add feedback

PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices

Liu, Yangyijian, Li, Jun, Li, Wu-Jun

arXiv.org Artificial IntelligenceJun-16-2025

The high memory and computation demand of large language models (LLMs) makes them challenging to be deployed on consumer devices due to limited GPU memory. Offloading can mitigate the memory constraint but often suffers from low GPU utilization, leading to low inference efficiency. In this work, we propose a novel framework, called pipelined offloading (PIPO), for efficient inference on consumer devices. PIPO designs a fine-grained offloading pipeline, complemented with optimized data transfer and computation, to achieve high concurrency and efficient scheduling for inference. Experimental results show that compared with state-of-the-art baseline, PIPO increases GPU utilization from below 40% to over 90% and achieves up to 3.1$\times$ higher throughput, running on a laptop equipped with a RTX3060 GPU of 6GB memory.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2504.03664

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

Transfer data from your Android phone to your Windows PC: The ultimate guide

PCWorldOct-31-2024, 13:00:00 GMT

Nowadays, smartphones replace the (video) camera on holiday, acts as a portable music player, has space for all WhatsApp media, and holds audio plays, e-books, and documents. To avoid losing such data, you should create regular backups and your home Windows PC is ideal for this. The home computer is also a good data source, as it often houses downloads, music libraries, and video archives. However, if you want to transfer music, videos, or images between your smartphone and a Windows PC, you are spoiled for choice. There are a whole range of different methods available for this data transfer. The simplest and quickest method of connecting an Android device to your Windows PC is the classic USB cable.

data transfer, smartphone, window pc, (15 more...)

PCWorld

Country: North America > United States (0.05)

Industry:

Media > Music (0.55)
Information Technology > Services (0.50)
Information Technology > Software (0.36)
Media > Photography (0.35)

Technology:

Information Technology > Communications > Mobile (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (0.41)

Add feedback

Confidential Computing on nVIDIA H100 GPU: A Performance Benchmark Study

Zhu, Jianwei, Yin, Hang, Deng, Peng, Zhou, Shunfan

arXiv.org Artificial IntelligenceSep-13-2024

This report evaluates the performance impact of enabling Trusted Execution Environments (TEE) on nVIDIA H100 GPUs for large language model (LLM) inference tasks. We benchmark the overhead introduced by TEE mode across various LLMs and token lengths, with a particular focus on the bottleneck caused by CPU-GPU data transfers via PCIe. Our results indicate that while there is minimal computational overhead within the GPU, the overall performance penalty is primarily attributable to data transfer. For the majority of typical LLM queries, the overhead remains below 5%, with larger models and longer sequences experiencing nearly zero overhead.

gpu, llama-3, overhead, (12 more...)

arXiv.org Artificial Intelligence

2409.03992

Genre: Research Report > New Finding (0.67)

Industry: Information Technology > Hardware (0.65)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Employing Artificial Intelligence to Steer Exascale Workflows with Colmena

Ward, Logan, Pauloski, J. Gregory, Hayot-Sasson, Valerie, Babuji, Yadu, Brace, Alexander, Chard, Ryan, Chard, Kyle, Thakur, Rajeev, Foster, Ian

arXiv.org Artificial IntelligenceAug-26-2024

Computational workflows are a common class of application on supercomputers, yet the loosely coupled and heterogeneous nature of workflows often fails to take full advantage of their capabilities. We created Colmena to leverage the massive parallelism of a supercomputer by using Artificial Intelligence (AI) to learn from and adapt a workflow as it executes. Colmena allows scientists to define how their application should respond to events (e.g., task completion) as a series of cooperative agents. In this paper, we describe the design of Colmena, the challenges we overcame while deploying applications on exascale systems, and the science workflows we have enhanced through interweaving AI. The scaling challenges we discuss include developing steering strategies that maximize node utilization, introducing data fabrics that reduce communication overhead of data-intensive tasks, and implementing workflow tasks that cache costly operations between invocations. These innovations coupled with a variety of application patterns accessible through our agent-based steering model have enabled science advances in chemistry, biophysics, and materials science using different types of AI. Our vision is that Colmena will spur creative solutions that harness AI across many domains of scientific computing.

application, colmena, workflow, (16 more...)

arXiv.org Artificial Intelligence

2408.14434

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > Illinois > Cook County > Lemont (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Workflow (1.00)
Research Report > Promising Solution (0.34)

Industry:

Health & Medicine > Therapeutic Area (0.68)
Government > Regional Government (0.46)
Energy > Energy Storage (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.54)

Add feedback

Unlocking the Potential of Binding Corporate Rules (BCRs) in Health Data Transfers

Compagnucci, Marcelo Corrales, Fenwick, Mark, Haapio, Helena

arXiv.org Artificial IntelligenceJul-30-2024

This chapter explores the essential role of Binding Corporate Rules (BCRs) in managing and facilitating secure health data transfers within corporate groups under the EU General Data Protection Regulation (GDPR). BCRs are tailored to ensure compliance with the GDPR and similar international data protection laws, presenting a flexible mechanism for transferring sensitive health and genomic data. The chapter situates BCRs within the broader spectrum of the GDPR international data transfer mechanisms, addressing the unique challenges posed by the sensitive nature of health data and the increased adoption of AI technologies. The European Data Protection Board (EDPB) Recommendations 1/2022 on BCRs, issued following the Schrems II decision, are critically analyzed, highlighting their stringent requirements and the need for a balanced approach that prioritizes data protection and an AI governance framework. The chapter outlines the BCR approval process, stressing the importance of streamlining this process to encourage broader adoption. It underscores the necessity of a multidisciplinary approach in developing BCRs, incorporating recently adopted international standards and frameworks, which offer valuable guidance for organizations to build trustworthy AI management systems. They guarantee the ethical development, deployment, and operation of AI, which is essential for its successful integration and the broader digital transformation. In conclusion, BCRs are positioned as essential tools for secure health data management, fostering transparency, accountability, and collaboration across international borders. The chapter calls for proactive measures to incentivize BCR adoption, streamline approval processes, and promote more innovative approaches, ensuring BCRs remain a robust mechanism for global data protection and compliance.

bcr, corrale compagnucci, data transfer, (14 more...)

arXiv.org Artificial Intelligence

2407.21281

Country:

North America > United States (0.14)
Asia > Singapore (0.04)
South America > Uruguay (0.04)
(16 more...)

Genre: Research Report (0.84)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Regional Government > Europe Government (0.69)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Biomedical Informatics > Clinical Informatics (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.93)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.48)

Add feedback

TensorTEE: Unifying Heterogeneous TEE Granularity for Efficient Secure Collaborative Tensor Computing

Han, Husheng, Zheng, Xinyao, Wen, Yuanbo, Hao, Yifan, Feng, Erhu, Liang, Ling, Mu, Jianan, Li, Xiaqing, Ma, Tianyun, Jin, Pengwei, Song, Xinkai, Du, Zidong, Guo, Qi, Hu, Xing

arXiv.org Artificial IntelligenceJul-11-2024

Heterogeneous collaborative computing with NPU and CPU has received widespread attention due to its substantial performance benefits. To ensure data confidentiality and integrity during computing, Trusted Execution Environments (TEE) is considered a promising solution because of its comparatively lower overhead. However, existing heterogeneous TEE designs are inefficient for collaborative computing due to fine and different memory granularities between CPU and NPU. 1) The cacheline granularity of CPU TEE intensifies memory pressure due to its extra memory access, and 2) the cacheline granularity MAC of NPU escalates the pressure on the limited memory storage. 3) Data transfer across heterogeneous enclaves relies on the transit of non-secure regions, resulting in cumbersome re-encryption and scheduling. To address these issues, we propose TensorTEE, a unified tensor-granularity heterogeneous TEE for efficient secure collaborative tensor computing. First, we virtually support tensor granularity in CPU TEE to eliminate the off-chip metadata access by detecting and maintaining tensor structures on-chip. Second, we propose tensor-granularity MAC management with predictive execution to avoid computational stalls while eliminating off-chip MAC storage and access. Moreover, based on the unified granularity, we enable direct data transfer without re-encryption and scheduling dilemmas. Our evaluation is built on enhanced Gem5 and a cycle-accurate NPU simulator. The results show that TensorTEE improves the performance of Large Language Model (LLM) training workloads by 4.0x compared to existing work and incurs only 2.1% overhead compared to non-secure training, offering a practical security assurance for LLM training.

granularity, overhead, verification, (12 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3622781.3674168

2407.08903

Country:

North America > United States > California > San Diego County > La Jolla (0.05)
Asia > China > Shanghai > Shanghai (0.04)
North America > United States > New York > New York County > New York City (0.04)
(2 more...)

Genre: Research Report > New Finding (0.48)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback